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Abstract. 

A kinetic model for the nucleation mechanism of protein folding is proposed. A protein is modeled as a 
hcteropolymer consisting of hydrophobic and hydrophilic beads with equal constant bond lengths and bond 
angles. The total energy of the hcteropolymer is determined by the repulsive/ attractive interactions of non- 
linked beads and the contribution from the dihedral angles involved. Their parameters can be rigorously 
defined, unlike the ill defined surface tension of a cluster of protein residues which is the basis of the previous 
model. As a crucial idea of the model, the dihedral potential in which a selected bead is involved is averaged 
over all possible configurations of neighboring beads along the protein chain. The resulting average dihedral 
potential of the residue is constant far enough from the cluster, but increases monotonically with decreasing 
distance below a threshold value. An overall potential around the cluster wherein a residue performs a 
chaotic motion is a combination of the average dihedral and pairwise potentials. As a function of distance 
from the cluster it has a double well shape. Residues in the inner well are considered as belonging to the 
cluster (folded part of the protein) while those in the outer well are treated as belonging to the unfolded 
(although compact) part of the protein. A double well shape of the potential around the cluster allows 
one to determine its emission and absorption rates by using a first passage time analysis and develop a 
self-consistent kinetic theory for the nucleation mechanism of protein folding. Numerical calculations for 
a protein of 2500 residues with the diffusion coefficient of residues in the native state ranging from f -6 
cm 2 /s to 10~ 8 cm 2 /s predict folding times in the range from several seconds to several hundreds of seconds. 
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1 Introduction 



Proteins play an overwhelmingly dominant role in life. If a specific job has to by done in a living 
organism, it is almost always a protein that does it. Life depends on thousands of different pro- 
teins whose structures are fashioned so that individual protein molecules combine, with exquisite 
precision, with other molecules. In order for a protein molecule to carry out a specific biologi- 
cal function, it has to adopt a well-defined three-dimensional structure. 1 ' 2 The formation of this 
structure (of a biologically active globular protein) constitutes the core of a so-called "protein 
folding problem". 3 Many thermodynamic and kinetic aspects of the process remain obscure and 
its mechanism elusive. 4-8 

Experiment and simulation suggest that there exist multiple pathways for the protein folding. 6- 
It is believed that initially a denatured protein very quickly transforms into a compact (but not 
native) configuration with a few, insignificant amount of tertiary contacts. The transition from 
such a compact configuration to the native one has been suggested to occur via two distinct mech- 
anisms. One of them can be referred to as a "transition state mechanism" whereby the tertiary 
contacts of the native structure form as the protein passes through a sequence of intermediate 
states thus gradually achieving its unique spatial configuration. 6-15 The protein in intermediate 
states has a native-like overall topology but is stabilized by incorrect hydrophobic contacts. These 
states correspond to misfolded forms of the native protein. The transition from these intermedi- 
ate, misfolded states to the correctly folded, native structure is a slow process (because it involves 
a large-scale rearrangement of the molecule) occurring on a relatively large time scales. 14 ' 15 Alter- 
natively, the transition from the compact "amorphous" configuration to the native state occurs 
immediately following the formation of some number of tertiary contacts. 14 ' 15 This mechanism is 
similar to nucleation, i.e., once a critical number of (native) tertiary contacts is established the 
native structure is formed without passing through any detectable intermediate states. 14 

So far most of the work on protein folding has been done by using either Monte Carlo (MC) or 
molecular dynamics (MD) simulations. A rigorous theoretical treatment of the problem by means 
of the statistical mechanics is hardly practicable because of the extreme complexity of the system, 
although some approximate treatments were already reported. 16,17 A theoretical model for the 
nucleation mechanism of the process has so far remained underdeveloped. 18,14,19 The model which 
existed so far is a thermodynamic one considering the formation of a cluster of protein residues 
and calculating its free energy change, much like the classical nucleation theory (CNT) does. The 
cluster is characterized by v, the total number of residues, with the mole fraction of hydrophobic 
ones assumed to be known. As usual in CNT, the size of a critical cluster (nucleus) is provided by 
the location of the maximum of the free energy of formation as a function of v. Such an approach 
necessarily involves the concept of surface tension for a cluster consisting of protein residues. 
(Clearly, this quantity is an intrinsically ill-defined physical quantity and can be considerer only 
as an adjustable parameter; no direct experimental measurement thereof is possible.) After the 
formation of the nucleus (critical size cluster of residues), the protein quickly reaches its native 
state. 

In what follows work I present a new, microscopic model for the nucleation mechanism of the 
protein folding. The new model is based on "molecular" interactions, both long-range (due to 
repulsion/attraction) and configurational (due bond and dihedral angles), in which protein residues 
are involved. Their parameters can be rigorously defined, and it should be possible (although not 
straightforward) to determine them theoretically, computationally, or experimentally. The ill- 
defined surface tension of a cluster of protein residues does not enter into the new model which 
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is thus more advanced than the CNT-based one. The crucial idea underlying the new model 
consists of averaging the dihedral potential in which a selected residue is involved over all the 
possible configurations of neighboring residues. The resulting average dihedral potential depends 
on the distance between residue and cluster. Its combination with the average long range potential 
(due to pairwise interactions of the selected residue with those in the cluster) gives rise to the 
overall potential which has generates a pair of potential wells around the cluster with a barrier 
between them. Residues in the inner well are considered to belong to the cluster (part of the 
protein with correct tertiary contacts) while those in the outer well are treated as belonging to 
the mother phase (amorphous part of the protein with incorrect tertiary contacts). Transitions 
of residues from the inner well into the outer one and vice versa are considered as elementary 
emission and absorption events, respectively. The rates of emission and absorption of residues by 
the cluster are determined by using the first passage time analysis. 20-25 Once these rates are found 
as functions of the cluster size, one can develop a self-consistent kinetic theory for the nucleation 
mechanism of folding of a protein. For example, the size of the critical cluster (nucleus) is then 
found as the one for which these rates are equal. The time necessary for the protein to fold can 
be evaluated as a sum of the times necessary for the appearance of the first nucleus and the time 
necessary for the nucleus to grow to the maximum size (of the folded protein in the native state). 

The paper is structured as follows. In Section 2 we describe a random heteropolymer chain 
whereby a protein molecule is often modeled 6,14 and outline a CNT-based model for the nucleation 
mechanism of protein folding, whereof a new, microscopic model is proposed in Section 3. The 
results of numerical calculations are presented in Section 4 and a brief discussion and conclusions 
are summarized in Section 5. 



2 Heteroplymer chain as a protein model and a CNT based model 
for the nucleation mechanism of protein folding 

2.1 A heteropolymer as a protein model 

As a simple model of a protein in MD and MC simulations of protein folding dynamics, the 
polypeptide chain of a protein is considered 6 ' 14 as a heteropolymer consisting of N connected 
beads which can be thought of as representing the a-carbons of various amino acids. The het- 
eropolymer may consist of hydrophobic (b), hydrophilic (I), or neutral (n). Two adjacent beads 
are connected by a covalent bond of fixed length r/. This model (and its variants), augmented 
with appropriate interaction, bond, and dihedral potentials described below, was shown 6,14,16,18 
to be capable of capturing the essential characteristics of protein folding process even though it 
contains only some features of a real polypeptide chain. For example, this model ignores side 
groups although they are known to be crucial for intramolecular hydrogen bonding. 1 Besides, the 
presence of solvent (water) in a real physical system has been usually accounted for too simplisti- 
cally, although the protein dynamics was reported to be more realistic in MD simulations where 
solvent molecules are explicitly present. 26 Despite these limitations, various modifications of het- 
eropolymer models 5,16,18,2 shed light on some important details of the folding of polypeptide 
chains, such as possible pathways for the protein transition from a denatured state to the native 
one. 6,14 

The total energy of the heteropolymer (polypeptide chain) can contain the contributions of 
three different types. First, the contribution from repulsive/attractive forces between pairs of 
non-adjacent beads (these can be, e.g., of Lennard- Jones type or others). The next contribution 
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can arise from harmonic forces related with the oscillations of bond angles. Finally, there is a 
contribution from the dihedral angle potential due to the rotation around the peptide bonds. 
There are various ways to model these three types of energetic terms. 6 ' 14 ' 27-29 

A pair interaction between two non-adjacent beads i and j at a distance r^- away from each 
other can be taken, for example, as 6,14 

f 4e b [( V /r tj ) 12 - ( V /r tj f] (i,j=b), 
<f>ij(r ij ) = l ^[(v/r^) 12 + (v/r^) 6 ] (i = l,j = b,l), (1) 
[ 4e n (f]/r ij ) 12 (i = n,j = b,l,n), 

where r] is the bond length (fixed) and and e n are the energy parameters. 

The angle (3 between two successive bonds (in the heteropolymer) can be regarded to be 
subjected to a harmonic potential 

0/3 = ^(/3-/?o), (2) 

where the spring constant kp is relatively large (in refs.6,14 it was taken 20e;,/(rad) 2 = 105° so 
that the deviation of the bond angles from the average value (3q is very small. Hence all bond 
angles can be set to be equal to (3q (as argued in refs.6,14, the bond angle forces play a minor role 
in the protein folding/unfolding). 

The dihedral angle potential arises due to the rotation of three successive peptide bonds 
connecting four successive beads, and is related to the dihedral angle 6 as 

<Ps = 4(1 + cos 8) + ejf(l + cos 3(5), (3) 

where e' s and e'g are independent energy parameters. This potential has three minima, one in the 



trans configuration at 8 = and two others in the gauche configurations at 6 = ± arccos y (3e^ — e' s )/12e'g 
(the former one is lower than the latter two). 

The above structure of potential functions for a heteropolymer was suggested by Honeycutt 
and Thirumalai, 6 while Bryngelson and Wolynes 16,18 used a random energy model and Skolnik and 
co-workers 27-29 developed discrete analogs (for a diamond lattice) of eqs.(l) and (3) augmented 
with a "cooperativity potential" as a crucial element of the model. It was shown 6 ' 14 by MD 
simulations (employing low friction Langevin dynamics) that a proper balance between the above 
three contributions to the total energy of the heteropolymer ensures that the heteropolymer folds 
into a well defined /3-barrel structure. The balancing between these terms is performed by adjusting 
the energy parameters e&, q, e n , e' s , e' s for each type of beads. It was also found 6 ' 14 that the balance 
between the dihedral angle potential, which tends to stretch the molecule into a state with all 
bonds in a trans configuration, and the attractive hydrophobic potential is crucial to induce 
folding int a /3-barrel like structure upon cooling. Excessively dominant attractive forces make the 
heteropolymer fold into a globule-like structure, while an overwhelming dihedral angle potential 
makes the chain remain in an unfolded (elongated) state (even at low temperatures) with bonds 
mainly in the trans configuration. 

A possibility that the nucleation-like mechanism can constitute the most viable pathway for the 
protein folding was first suggested by Guo and Thirumalai 14 (although Bringelson and Wolynes 18 
also drew the analogy between the results presented therein and the thermodynamics of cluster 
formation in the framework of CNT). The formalism of the nucleation mechanism for the protein 
folding is invoked to evaluate the size of the critical cluster (nucleus) of native protein residues 
(whereof the formation leads to a quick transition of the whole protein into its native state). 



4 



Denoting the total number of residues in the protein by No, let us consider a formation of a 
cluster having a correct tertiary structure in an unfolded protein. The free energy of formation 
of such a cluster of v native residues (i.e., residues which are in the same state as they are in the 
native protein) can be written (in the framework of CNT) as 

W = -i/A/i + (t4ttAV /3 , (4) 

where 5(i = fid — fx n is the difference between the free energy per residue in the denatured and 
native states, respectively (marked with the subscripts "d" and "n"), a is the "surface" tension 
(energy) of the boundary between the cluster (having a native structure) and the unfolded part 
of the protein, A = (3w/47r) 1 ^ 3 , and v is a volume per protein residue in its native state. 

In ref.14 it was argued that the initial stage of the protein folding is driven by a hydrophobic 
attractive forces so that the volume term (i.e., the first one) in eq.(4) was determined by the 
number of hydrophobic contacts in the cluster and hence could be specified as — (l/2)ebX l '( K X l/ — 1)) 
where x is the mole fraction of hydrophobic residues in the cluster (assumed the same as in the 
whole protein). As a result the number of residues in the critical cluster was given as v c = 
(87ro"A 2 /3x 2 efe)'-3/4) which for typical values of X,a, and 65 was estimated 14 to be of the order 
of 10. In ref.16, A/i in the volume term of eq.(4) was evaluated to be of the order of O.lksT 
{Ub is the Boltzmann constant, and T is the temperature). The "surface" tension was argued 
to arise because the amino acid residues located at the cluster surface interact stronger with the 
cluster interior than with the unfolded part of the protein. Since the interaction energies in protein 
folding are of the order of ksT, the surface tension a multiplied by 47rA 2 was estimated to be 
of the same order and the number of residues in the critical cluster was evaluated 18 to be of the 
order of 100 (for Nq = 150). Both estimates corroborate the idea that the nucleation mechanism 
can constitute a viable pathway for the protein folding. 31-35 



3 Kinetics of nucleation during protein folding 

The CNT-based model for the nucleation mechanism of protein folding is limited to its thermo- 
dynamics, namely to the free energy of formation of the cluster of native residues. For a system 
in the thermodynamic limit (both the number of molecules N — > 00 and the volume V — > 00), the 
validity of expression (4) for the free energy of cluster formation in various [i.e., canonical (NVT), 
grand canonical (/xVT), and Gibbs (NPT)] ensembles was well established. 36 ' 37 If nucleation oc- 
curs in a finite size system, there appear additional terms on the RHS of eq.(4) which depend not 
only on the size of the system but also on the nature of the ensemble. However, a folding protein 
(mostly containing much less than a couple of thousands of amino acids) can hardly be considered 
to satisfy the thermodynamic limit. Furthermore, the cluster formation during protein folding 
occurs under conditions which cannot be identified with either of commonly used thermodynamic 
ensembles. Besides, the CNT based model (described above) has inherited a complicated problem 
of CNT related to the surface tension of the cluster. It was argued that the concept of surface 
tension may not be adequate for too small clusters (such as those of interest in nucleation), 38,39 
not to mention the assumption (of CNT) that it is equal to the surface tension of a planar in- 
terface. Although CNT produces reasonable agreement with experiment on unary nucleation, its 
application to multicomponent nucleation leads to several inconsistencies and large discrepancies 
with experimental data 40-45 which are blamed on the inadequate use of the concept of surface 
tension. In the case of protein folding this problem is even more complicated because a in eqs.(4) is 



5 



an ill-defined quantity which is experimentally impossible to determine due to the non-existence 
of bulk "folded protein" and "unfolded protein" as real physical phases, not to mention a flat 
interface between them. 

In order to avoid the use of macroscopic thermodynamics in the kinetic theory of unary nu- 
cleation, an alternative approach was proposed 20-22 on the basis of the mean first passage time 
analysis. Unlike CNT, that theory 20-22 is built upon molecular interactions and does not make 
use of the free energy of formation of tiny clusters involved in nucleation. Instead, the theory 20-22 
exploits the fact that one can derive and solve the kinetic equation of nucleation (hence find the 
nucleation rate) if the emission and absorption rates of a cluster are known as functions of its 
size. For the rate of absorption of molecules by the cluster, the new approach uses (as CNT does) 
a standard gas-kinetic expression, 46 but the rate of emission of molecules by the cluster is deter- 
mined via a mean first passage time analysis. This time is calculated by solving a single-molecule 
master equation for the probability distribution function of a surface layer molecule moving in a 
potential well around the cluster. The master equation is a Fokker-Planck equation in the phase 
space which can be reduced to the Smoluchowski equation owing to the hierarchy of characteristic 
time scales in the evolution of the single-molecule distribution function with respect to coordi- 
nates and momenta. 20-22 Recently, a further development of that kinetic theory was proposed 
by combining it with the density functional theory (DFT) 23,24 and extending it to binary 24 and 
heterogeneous 25 systems. 

Note that although the emission rate of the cluster in refs. 20-25 was found by using a first pas- 
sage time analysis, for the absorption rate there was used an expression derived in the framework of 
the gas-kinetic theory of gases 46 which assumes a Maxwellian distribution of velocities of mother 
phase molecules. This assumption being unquestionably valid for vapor-to-liquid nucleation in 
dilute (if not ideal) gases, becomes increasingly inaccurate as the density of the mother phase in- 
creases and molecular interactions therein become non-negligible. Clearly, this assumption (hence 
the absorption rate based thereupon) is inadequate in considering the cluster formation during 
the protein folding. Indeed, the amino acid residues of the protein are all successively linked by 
bonds of virtually fixed length each and fixed angle between each pair. 

In this section we will present a new, kinetic model for the nucleation mechanism of protein 
folding based on the first passage analysis which will be used for determining not only the rate 
of emission (of native residues from the cluster) but also the rate of absorption (of non-native 
residues by the cluster). The general formalism of our model is a mean first passage time analysis, 
but a crucial modification (compared to refs. 20-25) must be introduced thereto in order to make 
it applicable to nucleation in a protein. This modification concerns the potential well generated 
around the cluster as a result of all its interactions with a residue which moves around the cluster 
while being a part of the protein backbone (a bead in a heteropolymer) . 

3.1 Potential well around a cluster within a protein 

A heteropolymer chain as a protein model, originally proposed in refs. 6, 14 and described above, 
consists of three types of beads - neutral, hydrophobic, and hydrophilic. The neutral beads play 
an important role in that model. Their interaction with each other is purely repulsive and the 
dihedral angle forces are assumed to be weaker for the bonds involving them so that the bend 
formation is enhanced in regions where they are present. In real proteins this kind of residues 
can be thought to ensure the formation of loops and turns. MD simulations 6 ' 14 show that such 
a heteropolymer acquires a /3-barrel shape in the lowest energy structure, with neutral residues 
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appearing mostly in bend regions. This work is not aimed at obtaining a /3-barrel structure of 
the folded protein, so neutral beads will be removed from the model. Clearly, this will require to 
rebalance e's in eqs.(l),(3) in order to facilitate the formation of loops and turns in a heteropolymer 
chain. 

Thus, the two-component heteropolymer chain as a model for a protein consists of only hy- 
drophobic and hydrophilic beads without neutral ones with the pair interaction, bond angle, and 
dihedral angle potentials given by eqs.(l)-(3). With this assumption, the formation of a cluster 
consisting of native residues during the protein folding can be regarded as binary nucleation. We 
shall therefore present a model for the nucleation mechanism of protein folding in terms of binary 
nucleation by using a first passage time analysis 20-25 with a crucial modification concerning the 
potential well around the cluster. 

Consider a binary cluster of spherical shape (with sharp boundaries and radius R) immersed 
in a binary fluid mixture. 24 A molecule of component i {i = b, I) located in the surface layer of 
the cluster was considered to perform thermal chaotic motion in a spherically symmetric potential 
well 4>i{r) resulting from the pair interactions of this molecule with those in the cluster. Assuming 
pairwise additivity of the intermolecular interactions, 4>i{r) is provided by 



Here r is the coordinate of the surface molecule i, Pj(r) (j = 1,2) is the number density of 
molecules of component j at point r' (spherical symmetry is assumed, the cluster center chosen 
as the origin of the coordinate system), and <f>ij(\r' — r|) is the interaction potential between two 
molecules of components i and j at points r and r', respectively. The integration in eq.(5) goes 
over the whole volume of the system, but the vapor phase contribution can be assumed to be 
small and accounted for by a particular choice of the ej, and e;. 

For nucleation in proteins the potential ipi( r ) f° r a residue of type i around the cluster is 
determined not only by the potential (f>i(r), but also by two other contributions, <f>p(r) and (f>s(r), 
due to the bond angle and dihedral angle potentials, respectively: 



Without affecting the generality of the model, one can significantly simplify the algebra and 
eventual numerical calculations by assuming that all bond angles are fixed and equal to (3q = 105°. 
Under this assumption the contribution to the potential energy of the protein arising from the 
bond angle potential is constant and does not depend on the distance r between the selected bead 
and the center of the cluster. Therefore, the term 0/j(r) on the RHS of eq.(6) can be disregarded 
(or, equivalently, be chosen as a reference level for the potential energy), i.e., 



The term 4>' s (r) = (f)' s (r, r2, rs, r4, r^, re, r?) in Y>j(r) is due to the dihedral angle potential of the 
whole protein. Consider bead 1 (of type b or I) at a distance r from the center of the cluster (see 
Figure 1). The total dihedral angle potential of the whole protein chain for a given configuration 
of beads 2,3,. ..,N can be written in the form 




(5) 



A(r) = 4>i{r) + <j>p(r) + <j) S (r). 



A(r) = 4>i{r) + 4>&{r)- 



(6) 



(7) 
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where 5^ is a dihedral angle between two planes, one of which is determined by beads k 
and the other by beads j, k, I. On the RHS of the above equation, an independent of r term is 
omitted which represents the contributions from the dihedral angles involving beads 8, 9, ...,7V 
(hence <f>' s (r) does not depend on coordinates of beads 8, ...,Nq). It can be regarded as affecting 
only the reference level for Y>j(r). 

Consider bead 1 at a given distance from the cluster r. Various configurations of beads 
2, 3, N (subject to the fixed bond length and bond angle constraints as well as to the constraint 
of excluded cluster volume) lead to various sets of dihedral angles. However, variations in the 
location of beads 8, 9, ...,7V lead to variations in the dihedral potential which are independent 
of r. Thus, the dihedral term (j>$(r) (less an independent of r term omitted hereafter) on the 
RHS of eq.(7) can be obtained by averaging eq.(8) with the probability distribution function 
p(r, r 2 , r3, r4, rs, re, Yj) for configurations of beads 2 to 7 with a fixed location of bead 1 and 
assigning the result to the latter: 

4>s(r) = dr 2 dr 3 dr 4 dr 5 dr 6 dr 7 <^(r)p(r,r 2 , r 3 , r 4 , r 5 ,r 6 ,r 7 ) (8) 

Jn 18 

where n (i = 2,..., 7) is the radius-vector of bead i and Q,is is the integration region in an 
18-dimensional space. 

It is convenient to choose a Cartesian system of coordinates with the origin in the cluster 
center in such a way that the coordinates of bead 1 are x\ = 0,yi = 0, z\ = r (see Figure 1). The 
Cartesian coordinates of other beads will be denoted by Xi,yi,z% (i = 2, ...,7). The Cartesian 
coordinates of bead 2 are related to its spherical ones r 2 , 02, f2 by 

X2 = f2 sin 02 cos ip, ?/2 = f2 sin 02 sin <p, z 2 = r 2 cos ©2- (9) 

At a given r (the location of bead 1 is fixed), the polar angle of bead 2 is uniquely determined 
by r 2 due to the constant bond length constraint: 

2 = 2 (r, r 2 ) = arccos[(r 2 + r\ - r/ 2 )/(2r 2 r)] (0 < 2 < vr), (10) 

whereas the azimuthal angle < 4>2 < 2tt. The distance varies in the range 

r2mm <r 2 <r + T], (11) 

where r 2min = max(R, r — rj). Thus the integration with respect to r 2 in eqs.(8) reduces to 
integration with respect to ipi and r 2 with fixed 2 = 2 - 

For given locations of beads 1 and 2, the possible locations of beads 3 and 4 lie on circles of 
radius tq = rj sin [3q with their location and orientation completely determined by the coordinates 
of beads 1 and 2. This is due to the constraints that all bond angles are equal to (3q and all bond 
lengths are equal to rj. Due to the same constraints, if the locations of beads 2 and 4 are given, 
the possible locations of bead 6 lie on a circle of radius tq, whereof the location and orientation 
are completely determined by the coordinates of beads 2 and 4. Further, for the given locations 
of beads 1 and 3, the possible locations of bead 5 lie on a circle of radius ro with the position 
and orientation completely determined by the coordinates of beads 1 and 3. Finally, for the given 
locations of beads 3 and 5, the possible locations of bead 7 are on a circle of radius ro, with the 
location and orientation completely determined by the coordinates of beads 3 and 5. 

Let us consider bead s with unknown coordinates and two other beads, c and n (closest to 
and next to the closest to bead s) with known coordinates x c ,y c ,z c and x n ,y n ,z n . For example, 
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if s = 7, then c = 5, n = 3; if s = 4, then c = 2, n = 1. Bead s lies on a circle of radius tq with the 
coordinates of the center 

. \ ( \ r/(l + | cos Bo|) , s 

Xq = xo(x n , y n , z n , x c , y c , z c ) = x n + (x c x n ) , === === , (12) 

V(x c - x n ) z + (y c - y n ) z + (z c - z n y 

- ( \ , < \ r/(l + |cos9 |) . , 

2/0 = 2/ (x n ,?/ n ,z n ,x c ,y c ,z c ) = y„ + (y c - y n ) , (13) 

V(x c - x n )^ + (y c - y n ) z + (z c - z n ) z 



77(1 + I COS Go 
V(x c - x n )^ + (y c - y n )' 2 + (z c - z n ) 
The coordinate x s of bead s can change in the range 



^0 — ZQ[^x n , y n , z n , x c , y c , z c ) — z n ■+■ z ra j ^ ^ — . — ? ^ — . — 7 14^7 



- z& < £s < £ c + x b (s = 3, ...,7), (15) 

where 



_ / \ • n / ((y c -yo) 2 + ^c-^o) 2 ) n ^ 

x 6 = x b (x n ,y n ,2: n ,x c ,y c ,2: c ) = 77 sin 60^/7 To— ~, yTTl \2- ( 16 ) 

V (x c - x n ) 2 + (y c - y n ) 2 + (z c - z n y 

For a given x s , the coordinate y s of bead s can have only one of two values, 
= yf{ x s,x n ,yn,Zn,x c ,y c ,z c )=y ^ - T — -2 ± (17) 

(y c - yor + (z c - z y 

\z c - z \^[(y c - yo) 2 + (z c - z ) 2 ]rj 2 sin 2 (3 Q - [{x c - x n ) 2 + (y c - y n ) 2 + (z c - z n ) 2 ](x s - x ) 2 

(y c - yo) 2 + (z c ~ z ) 2 
For given x s and y s , the coordinate z s of bead s can have only a single value 

± = / ± n_ (gc ~ X ){X S - X ) + (jfe - 7/q) (7/ s - j/p) 

— ^sv^s; 2/s j Vni Zni Xci Z c ) — Zq . (4°J 

Zc Zq 

Thus, the probability distribution function p(r, r 2 , r 3 , r 4 , r 5 , r 6 , r 7 ) acquires the form 

p(r,r 2 ,r 3 ,r 4 ,r 5 ,r 6 ,r 7 ) = / _1 <5(0 2 - 8 2 )nJ =3 [%j - y~) - yt)]5(zi - z { ) x 

exp[-^(r, (r,r 2 ,r3,r4,r 5 ,r 6 ,r 7 )], (19) 

where / is a normalization constant determined by the condition 

/ dr 2 dr 3 dr 4: dr 5 dr 6 dr 7 p(r, r 2 , r 3 , r 4 , r 5 , r 6 , r 7 ) = 1. (20) 
J ilia 

Substituting eq.(19) into eq.(8) reduces an 18- fold integral to a 7-fold one: 

<M r ) = / 1 ^ / / dr 2 / dx 3 / dx^ / dx 5 / dx 6 / dx 7 x 

i J ,fc,m^=+,-- /o Jl I Jl I Jl I Jl T Jl? 

r\ sin 6 2 (r, r 2 ) (f>ijkmn(<P2, r 2 , x 3 , x 7 ) exp[-^ femn ((^ 2 , r 2 , x 3 , x 7 )/k B T], (21) 
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/ = } / dif2 / dr2 / dx 3 / dx4 / dx$ / / dx 7 x 

■ ■ , . JO JL 2 J L% J L 3 . JVi JL™ JL" 

i,j,k,m,n=+,— 3 4 5 6 7 

rf sin 6 2 (r,r 2 ) exp[-0 ijfc mn(^2, r 2 , x 3 , x 7 )/k B T]. (22) 

In these equations each of the summation indices takes on two values, + and — , so that there are 
5 2 terms in the sum differing by the integrand as well as by the integration ranges (except for L 2 
which is independent of i,j, k, m, n). 

The double inequalities (11) and (15) and eqs.(16)-(18) completely determine the integration 
ranges (subject to the additionl constraint x 2 s + y 2 + z 2 > R 2 (s = 2, ...,7) of excluded cluster 
volume) in eqs.(21) and (22). In order to transform the function <pg(r) = 4>'g(r, r 2 , r 3 , r 4 , r 5, r 6> r 7) 
into the function (pijkmn(r, <^ 2 , r 2 , a; 3 , x 7 ), the variables y s (s = 3, 7) and z s (s = 3, 7) in 
the former must be replaced by yf and zf in all the possible combinations each of which gives rise 
to a term in the sums in eqs.(21) and (22) (note that zf = z s (x s ,yf , ...) and z~ = z s (x s ,y~ , ...)). 
The polar angle of bead 2 must be replaced by 2 = 2 (r, r 2 ). 

For example, consider the term with i = +, j = — , k = +, m = +, n = — in the sum in 

eqs.(21),(22). The corresponding integrand <f> + hH (<£> 2 , r 2 , X3, X7) is obtained from 

<j)s(r, r 2 ,r3,r4,r5,r 6 ,r 7 ) as follows: 

y 3 must be replaced by y^ = y£(x 3 ,x 2 ,y2,Z2,xi,yi,zi) and z 3 by zf = z 3 (x 3 , yf, x 2 , y 2 , z 2 , Xi, yi, zi), 
y 4 must be replaced by 2/4 = yj(a; 4 , , Xi, yi, z\, x 2 , y 2 , z 2 ) and z 4 by z^ = z 4 (x 4 , , x ll y 1 , zi, x 2 , y 2 , z 2 ), 
y 5 must be replaced byyf = yf(x 5 ,x 1 ,y 1 ,z 1 ,x 3 ,y 3 ,z 3 ) and z 5 by = z 5 (x 5 ,yf ,x 1 ,y 1 , Z!,x 3 ,y 3 , z 3 ), 
y 6 must be replaced by yf = yf{x & , x 2 , y 2 , z 2 , x 4 , y 4 , z 4 ) and z 6 by = z 6 (x 6 , , x 2 , y 2 , ^2,^4, 2/4, 24), 
y 7 must be replaced by yf = yf (x 7 , x 3 , y 3 , z 3 , x 5 , y 5 , z 5 ) and z 7 by zf = z 7 (x 7 , yf , x 3 , y 3 , z 3 , x 5 , y 5 , z 5 ), 

Figure 2 presents 4>s(r), the contribution from the average dihedral potential to the total po- 
tential around the cluster, provided by eqs.(21),(22). It has quite a remarkable behavior. Starting 
with its maximum value at the cluster surface, it monotonically decreases with increasing r until 
it becomes constant for some r > r (see section 4). This behaviour can be accounted for by the 
entropic effect on the average dihedral potential assigned to a selected bead. Actually, the closer 
the selected bead (1) is to the cluster surface (for r < f), the more restricted is the configura- 
tional space available for the neighboring beads (2 through 7). This decreases the entropy of the 
heteropolymer chain compared to the case where bead is far enough away from the cluster which, 
in turn transpires as an increase in the average dihedral potential (f>s(r), assigned to bead 1, with 
decreasing r. With some degree of liberty, <j>s(r) can be interpreted as a constrained (with bead 
1 fixed) free energy of the heteropolymer chain 1 through 7. 

3.2 Determination of the emission and absorption rates 

Figure 3 presents typical shapes of the constituents <pi(r) and <fis(r) of the potential well as 
functions of the distance from the cluster center, as well as the overall potential well tpi(r) itself (for 
details of numerical calculations see section 4). The contribution 4>i(r), arising from the pairwise 
interactions, has a familiar form 20-25 reminiscent of the underlying Lennard-Jones potential. Its 
combination with the contribution (j)s(r) from the average dihedral potential results in the overall 
potential ipi(r) which has a double well shape: the inner well is separated by the potential barrier 
from the outer well. This shape of tpi( r ) is °f crucial importance to our model for the nucleation 
mechanism of protein folding because it makes it possible to use a mean first passage time analysis 
for the determination of the rate of absorption of beads by the cluster 

Developing our model in the spirit of the mean first passage time analysis, 20-25 a bead (residue) 
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is considered as belonging to the cluster as long as it remains in the inner potential well (hereinafter 
referred to as "i.p.w."), and as dissociated from the cluster w hen it passes over the barrier between 
the i.p.w. and the outer potential well (hereinafter referred to as "o.p.w."). The rate of emission, 
W~ , is determined by the mean time necessary for the passage of the bead from the i.p.w. over 
the barrier into the o.p.w. Likewise, a bead is considered as belonging to the unfolded part of 
the heteropolymer (protein) as long as it remains in the o.p.w., and as absorbed by the cluster 
when it passes over the barrier between the o.p.w. and the i.p.w.. The rate of emission, W + , 
is determined by the mean time necessary for the passage of the bead from the o.p.w. over the 
barrier into the i.p.w. 

The mean first passage time of a bead escaping from some potential well is calculated on 
the basis of a kinetic equation governing the chaotic motion of the bead in that potential well. 
The chaotic motion of the bead is assumed to be governed by the Fokker-Planck equation for the 
single-particle distribution function with respect to its coordinates and momenta, i.e., in the phase 
space. 47 " 49 Prior to the passage event, the evolution of a bead in both the i.p.w. and o.p.w. occurs 
in a dense enough medium (cluster folded residues or unfolded but compact part of the protein), 
where the relaxation time for its velocity distribution function is extremely short and negligible 
compared to the characteristic time scale of the passage process. Under favorable conditions, 
the Fokker-Planck equation reduces to the Smoluchowski equation, which involves diffusion in an 
external field. 48 ' 49 Solving that equation,one can obtain 24 the following expressions for W~ and 
W + , the emission and absorption rates, respectively: 

W ~ = n iwAw w iw> W+ = n ow D ow uj ow . (23) 

Here the subscripts "iw" and "ow" mark the quantities for the inner and outer potential wells, 
respectively; n is the number of beads in the well, D is the diffusion coefficient of the bead, and 
uj- 1w and u> ow are defined as 

u iw = VAw^iw w »w = 1/£>owT w, (24) 

where f; w and fj w are the mean first passage times for a bead in the i.p.w. to cross over the barrier 
into the o.p.w. and vice-versa, respectively. Note that in the original work 20-22 on the mean first 
passage time analysis in the nucleation theory and its recent development 23-25 this method was 
used only for determining W~ but not W + . In the present model, the double well character of 
the overall potential ip(r) around the cluster allows one to apply the first passage time analysis 
also to determining W + . 

Clearly, the quantities W~, W + , to- lw , w lw , f[ w , f[ w are the functions of the cluster size and 
composition (for the explicit form see ref.24). However, since the overall composition of the protein 
(heteropolymer) is fixed, one can assume that the cluster which forms during its folding has a 
constant composition equal to the overall protein composition which leads to a unary nucleation 
theory. (Strictly speaking, one can develop a theoretical model for the nucleation mechanism 
of protein folding without this assumption which would lead to a binary nucleation theory, but 
this would drastically complicate the problem computationally.) Under this assumptions the 
aforementioned quantities are functions of only the size of the cluster (say, its radius R or the 
total number of beads v therein). 

3.3 The equilibrium distribution and steady-state nucleation rate 

In terms of nucleation, during the protein folding clusters of various sizes may emerge and exist 
simultaneously with different probabilities. Let us denote the distribution of clusters with respect 
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to the number of beads in a cluster at time t by g(v,t). Once the emission and absorption rates 
W~ = W~(v) and W + = W + (u) are known as functions of the cluster size, one can find the 
equilibrium distribution of clusters g e (v, t) and solve the kinetic equation of nucleation to find the 
steady-state nucleation rate. 

Actually, according to the principle of detailed balance, W + {u — l)g e {v — 1) = W~ {y)g e {y), 
which can be rewritten as 

g e (u) W + (u-l) 



By applying eq.(25) to [y 
equalities, one obtains 



9e{v - 1) 
with i = 2,3,. 



w-m ■ (25) 

, v — 1, multiplying the RHSs and LHSs of all 



9eiy) 



v-1 



n 



W+{v 



\ w-(u-i + iy 



(26) 



The equilibrium distribution of clusters v = 1 is just the number density of residues in a compact 
(but unfolded) protein, i.e.,g e (l) = p u , so that equation (18) can be rewritten as 



W+{l) ^ W+(u-i + l) 
9e{v) - Pu w+{u) LI w - {y _ i + l y 



(27) 



Let us introduce the function G{y) = — kBTln[g e (v)/p u ]. Clearly, G{y) in the present theory 
plays a role similar to the the free energy of cluster formation in CNT 50-52 Under favorable 
conditions, G{y) first increases with increasing v, attains its maximum at some v = f c , and 
then decreases. In CNT, G{v) has also a minimum following the maximum but only for an 
NVT ensemble where the growth of the cluster (i.e., increase of u) leads to the decrease in the 
metastability of the mother phase. Since the folding protein cannot be considered as an NVT 
ensemble, we will not consider this case. The essence of our model, as an alternative to the 
CA-based theory, consists of constructing the equilibrium distribution of clusters, g e iy} (and the 
function G{u) without employing the classical thermodynamics. 

The kinetic equation of nucleation in the vicinity of the critical point can be written as 50-52 



dg(v,t) 
dt 



W7 



d_ 

du 



d_ dG 

du dv 



g(v, t). 



(28) 



(subscript "c" marks quantities at the critical point) and the function G{u) can be accurately 
represented by its bilinear form. The steady-state solution of the kinetic equation (20) in the 
vicinity of v c subject to the conventional boundary conditions 



1 (i/-0), 



9e{v) 



[y — > co), 



(29) 



where g e {v) is the equilibrium distribution, provides the steady-state nucleation rate 50 52 which 
can be presented in the form 

W+ 



J, 



where 



d 2 G 



dl 



-G c /kgT 



-1/2 



(30) 
(31) 
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3.4 Evaluation of the protein folding time 



Knowing the emission and absorption rates as functions of u as well as the nucleation rate J s , 
one can estimate the time tf necessary for the protein to fold via nucleation. To do so we will 
regard the protein folding (via nucleation) as a two stage process. At the first stage, a critical 
cluster of native residues form (nucleation proper). At this stage, i.e., for v < v c , the emission 
rate W~ is larger than W + , but the cluster does attain the critical size by means of fluctuations. 
At the second stage the nucleus grows via regular absorption of native residues dominating their 
emission, W~ < W + for v > v c . Thus, the folding time can be represented as 

t f ^t n + t g , (32) 

where t n is the time necessary for one critical cluster to nucleate within a compact (but still 
unfolded) protein and t g is time necessary for the nucleus to grow up to the maximum size, i.e., 
attain the size of a folded protein. 

The time t n of the first nucleation event can be estimated as 

tn ~ 1/JsVo, (33) 

where Vo is the volume of the unfolded protein in a compact state. The growth time t g can be 
found by solving the differential equation 

^ = W+(u)-W-(u) (34) 

subject to the initial condition v = v c at t = and the condition v = Nq at t = t n . The solution 
of eq.(34) is given by the integral 

[ N ° dv 

tn ^i c W+{ V )-W-[y)' (35) 



4 Numerical evaluations 

In this section we will present some numerical results of the application of our model to the folding 
of a model protein, namely, a heteropolymer consisting of total 2500 hydrophobic and hydrophilic 
residues, with the mole fraction of hydrophobic residues %o = 0.75. The interactions between a 
pair of non-linked beads were modeled via the Lennard-Jones (LJ) type potentials (1), while the 
potential due to the dihedral angle 5 was modeled according to eq.(3). The presence of water 
molecules was not taken into account explicitly but was assumed to be implemented into the 
model via the potential parameter. 

All numerical calculations were carried out for the following values of the interaction parame- 
ters: 

771 = 5.39 x 10~ 8 cm, ei = (2/700)e fe , e' s = eg = 0.3e 6 , e b /kT = 1. 

A typical density of the the folded protein was evaluated according to data in refs.53,54 and was 
set to pf>] 3 = 1.05, while a typical density of the unfolded protein in the compact configuration 
was set to be p u = 0.25/?j (note that similar values for pj and pd are suggested in ref.18). Taking 
into account the results in ref.55, the diffusion coefficients in the i.p.w. and the o.p.w. were 
assumed to be related as AwP/ = D a „p u . Because of the lack of reliable data on the diffusion 
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coefficient of a residue in a protein chain, D iw was assumed to vary between 10 -6 cm 2 /s and 10 -8 
cm 2 /s. 

Figure 2 shows the average dihedral potential (assigned to a selected bead) (f> s (r) as a function 
of r for three clusters of sizes (a) R = 3r/, (b) R = 6r], and (c) R = 9rj. The points represent the 
actual numerical results obtained by using 1 x 10 6 to 2 x 10 6 point Monte Carlo integration in 
calculating 7-fold integrals in eqs.(ll),(12). The vertical dashed lines correspond to r' (see above) 
such that 4> 5 (r) is expected to be constant for r > r' . The solid lines are analytic fits by an 
expression a + 6exp[— c(r — d) 2 ]. With the accuracy of our calculations the parameters a, b, and c 
of this fit do not change with R, while the parameter d is roughly R+rj. Clearly, with an increased 
accuracy of calculations we may eventually find some dependence of a, b, c on R, but with our 
current accuracy it appears that 4> s (r) has an universal shape independent on R. Undoubtedly, 
the calculation of (p s (r) will constitute the most time consuming procedure in applying our model 
to real life problems. It takes about 24 hours to obtain one value of 4> s (r) (one point in Fig.2) on 
a Dell/Pentium4/3Ghz/512Mb computer. 

Figure 3 presents typical shapes of the potentials (f>b(r) (lower solid curve), 4> s (r) (upper solid 
curve), and ipb (r) (dashed curve) for a hydrophobic bead around the cluster as functions of the 
distance r from the center of the cluster of radius R = 3t]. The potential <fo(r) is due to the 
pairwise interactions of the Lennard-Jones type, and has a shape reminiscent thereof. Previous 
applications 20-25 of the mean first passage time analysis to nucleation had invariably lead to this 
kind of the potential well around the cluster. 

The average dihedral potential (assigned to a selected bead) 4> s (r) has a maximum value at 
the cluster surface and decreases monotonically with increasing r until it becomes constant for 
r > f which is the maximum distance between beads 1 and 6 (or beads 1 and 7) dependent on 
R, rj, and Go: 

r = R + 7] ^1 + ^3 - 2 cos G + 2^2(1 - cos 6 ) sin . 

Such a behavior of 4> s (r) can be thought of as a consequence of an increase in the entropy of 
the heteropolymer chain as the selected bead 1 approaches the cluster surface for r < r' which 
occurs because the configurational space available for the neighboring beads (2 through 7) becomes 
more and more restricted. Once r becomes greater than r', this piece of the heteropolymer chain 
(beads 1 through 7) does not feel the presence of the cluster any more. With some degree of 
liberty 4>s(r) can be interpreted as a constrained (with bead 1 fixed) free energy of that piece of 
the heteropolymer which includes beads 1 through 7. 

As a result of the combination of <j>b(r) and <fo(r) the overall potential ipb{f) has a double well 
shape: the inner well is separated by the potential barrier from the outer well. The geometric 
characteristics of the wells (widths, depths, etc..) and the height and location of the barrier 
between them are determined by the interaction parameters e&, e/, eg. For example, the larger the 
ratio es/ € b, the higher the barrier between the well, the wider the i.p.w. and the narrower the 
o.p.w. Note that the barrier has different heights for beads in the i.p.w. and o.p.w. The outer 
boundary of the o.p.w. is due to the confining potential arising because all residues around the 
cluster are successively linked and are bound thereto. Hence they are confined within some volume 
wherein the protein is encompassed and the location r cf of the confining potential is assumed to 
coincide with its outer boundary. For a given No the location r cf is determined by the the size 
of the cluster and densities pf and p u . The existence of the o.p.w. allows one to consider the 
absorption of a bead by the cluster as an escape of the bead from the o.p.w. by crossing over the 
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barrier into the i.p.w. This makes it possible to use the mean first passage time analysis for the 
determination of the rate of absorption of beads by the cluster. Since the use of the traditional 
expression for the absorption rate (based on the gas- kinetic theory) 46 is rather inadequate in the 
cluster growth within the protein, the double- well shape of ipi(r) is of crucial importance to our 
model for the nucleation mechanism of protein folding. 

Figure 4 presents W~ and W + , the emission and absorption rates, respectively, as functions 
of the cluster size R. The location of the intersection of these functions determines the size of the 
critical cluster, R c . The emission rate is greater than the absorption rate, W~{r) > W + (r), for 
small clusters with R < R c , whereas for clusters larger than the nucleus the absorption dominates 
over the emission, W~(r) < W + (r) for R > R c . Note that both W~ and W + increase with 
increasing R, but W~ increases roughly linearly with R whereas W + shoots up by several orders 
of magnitude after the cluster becomes supercritical. This is a consequence of the fact that the 
width of the o.p.w. quickly decreases as the cluster grows while the outer height of the barrier 
between the i.p.w. and o.p.w. does not not virtually change, and it becomes increasingly easy for 
a bead which is in the o.p.w. to cross over the barrier and fall into the i.p.w. 

The behaviour of W~ and W + also explains our numerical estimates for the characteristic 
times of the first nucleation event t n , growth time t g , and total folding time tf by eqs.(32),(33), 
and (35). Although these depend very much on the location of R c and the value of D iw (R c itself 
does not depend on D iv but only on the ratio D ow /Aw), always t n » t g , i.e., the protein folding 
time is mainly determined by the time necessary for the first nucleation event. Physically, this 
is the case because the increase of the cluster size from v = 1 to v = v c occurs only owing to 
fluctuations which have to overcome the natural tendency of a small cluster to decay (W + < W~ 
for v < v c ). For supercritical clusters W + so quickly immensely overwhelms W~ that fluctuations 
are unable to impede the natural tendency of the cluster to grow (strictly speaking, this is true 
only for v > v c + Azv c , but for rough estimates eq.(35) is acceptable). For the above choice 
of system parameter and D iw in the range from 1CP 6 cm 2 /s to 10~ 8 cm 2 /s our model predicts 
the characteristic time of the protein folding (for iVo = 2500) in the range from several seconds 
to several hundreds of seconds which is in a very good agreement with expectations based on 
experimental data. 

5 Conclusions 

So far most of the work on the protein folding has been done by using either Monte Carlo (MC) or 
molecular dynamics (MD) simulations. The rigorous theoretical treatment of the protein folding 
by means of the statistical mechanics is hardly practicable because of the extreme complexity 
of the system. A number of simulations have suggested that there can exist multiple pathways 
for a protein to fold one of which has been identified as reminiscent of nucleation. However, a 
theoretical model for the nucleation mechanism of the process had so far remained underdeveloped. 
The previous model, based on the approach of the classical nucleation theory (CNT), was a purely 
thermodynamic one considered the formation of a cluster of protein residues and calculated the 
free energy change thereupon. 14 ' 18 The number of a critical cluster (nucleus) was provided by the 
location of the maximum of the free energy of formation as a function of a single independent 
variable of state of the cluster. In such a model the free energy of cluster formation depends on 
the surface tension of a cluster of protein residues. This quantity is an ill-defined physical quantity 
and can be considerer only as an adjustable parameter. According to the nucleation mechanism, 
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after the formation of the nucleus (critical size cluster of residues) , the protein quickly reaches its 
native state. 

In the present work we present a new, microscopic model for the nucleation mechanism of 
the protein folding. A protein is considered as a heteropolymer consisting of two type of beads 
(hydrophobic and hydrophilic) linked with bonds of fixed length. All bond angles are also assumed 
to be fixed and equal to 105°. All non-adjacent beads are assumed to interact via Lennard- 
Jones like potential. Besides these interactions, the total energy of the heteropolymer contains 
a contribution from dihedral angles of all triads of successive links. Unlike the old model, ours 
is developed without recurring to CNT approach. Instead, it is based on the above "molecular" 
interactions, both long-range and configurational. The parameters of these potentials can be 
rigorously defined, unlike the ill-defined surface tension of a cluster of protein residues. 

The crucial idea underlying the new model consists of averaging the dihedral potential in which 
a selected residues is involved over all the possible configurations of neighboring residues. The 
resulting average dihedral potential depends on the distance between the residues and the cluster 
center. It has a maximum at the cluster surface and monotonically decreases with increasing dis- 
tance therefrom. Its combination with the average potential due to pairwise interactions between 
the selected residue and those in the cluster a double potential well around the cluster with a 
barrier between the two wells. Residues in the inner well are considered to belong to the cluster 
(part of the protein with correct tertiary contacts) while those in the outer well are treated as 
belonging to the mother phase (amorphous part of the protein with incorrect tertiary contacts). 
Transitions of residues from the inner well into the outer one and vice versa are considered as 
elementary emission and absorption events, respectively. The rates of these processes are deter- 
mined by using the mean first passage time analysis. Once these rates are found as functions of 
the cluster size, one can develop a self-consistent kinetic theory for the nucleation mechanism of 
protein folding. For example, the size of the critical cluster (nucleus) is then found as the one 
for which these rates are equal. The time necessary for the protein to fold can be evaluated as a 
sum of the times necessary for the appearance of the first nucleus and the time necessary for the 
nucleus to grow to the maximum size (of the folded protein in the native state). 

For numerical illustration we have considered a model protein consisting of 2500 beads with 
the mole fraction of hydrophobic beads equal to 0.75. The composition of the cluster during its 
formation and growth was assumed to be constant and equal to the composition of the whole 
protein. This allows one to consider the model as a single component one. The size of the critical 
cluster and the folding time predicted by the model depend very much on many parameters of 
the system, such as interaction parameters, densities of the protein in the unfolded (but compact) 
and folded states, diffusion coefficients therein, etc. With an appropriate choice of interaction 
parameters and densities, the size of the critical cluster predicted by our model is about 220 
residues and with the free energy of nucleus formation being about 20/csT. This results suggest 
that the quantity equivalent to the "surface tension" in the old model of nucleation in a protein, 
should be smaller than the previous estimates 14 ' 18 of the latter by an order of magnitude. The 
characteristic time of protein folding was estimated to be in the range from several seconds to 
several hundreds of seconds depending on the diffusion coefficient of native residues in the range 
10 -6 cm 2 /s to 10~ 8 cm 2 /s. This is consistent with experimental data 55 on typical folding times of 
proteins as well as with estimates obtained by other theoretical models 18 and in simulations. 6 ' 14 

A further development of our model will require the removal of several simplifying assumptions 
that we recurred to in the present work. For example, it would be more appropriate to model 
a protein as a three-component heteropolymer (including not only hydrophobic and hydrophilic 
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residues, but also neutral ones). This will result in more lengthy numerical calculations of the 
average dihedral potential because the dihedral potential involving neutral beads is expected to 
be lower and requires separate calculations. Next, the cluster composition during its formation 
and growth can quite significantly depend on the cluster size, particularly in the vicinity of the 
critical size, so assuming it constant in the present work might have lead to serious inaccuracy in 
the results. Including neutral beads in the model and allowing the cluster composition to differ 
from that of the protein will result in a binary or even ternary nucleation mechanism of protein 
folding. However, besides some increase in computational efforts there seem to exist no principal 
difficulty in developing the model in these directions. Some other improvements of the model can 
be also introduced in the model, but they will be discussed in our future papers on the subject. 
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Figure 1: A scheme of a piece of a heteroplymer chain around the spherical cluster (shown only partly) of 
radius R. Bead 1 lies in the Fugure plane, whereas beads 2 through 7 may all lie in different planes, but 
all bond angles are equal to 105° and their lengths are equal to rj. The distance between the selected bead 
1 and the center of the cluster is r 
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Figure 2: The average dihedral potential (assigned to a selected bead) <fi s (r) as a function of r for three 
clusters of sizes (a) R = 3i], (b) R = Qr], and (c) R = 9rj. The points represent the actual numerical results 
obtained by using the Monte Carlo integration in eqs.(21),(22). The vertical dashed lines correspond to f . 
The solid lines are analytic fits by an expression a + &exp[— c(r — d) 2 ]. 
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Figure 3: Typical shapes of the potentials <fib(r) (lower solid curve), <j>s{r) (upper solid curve), and ipb{r) 
(dashed curve) for a hydrophobic bead around the cluster as functions of the distance r from the center of 
the cluster of radius R — 3?7. The outer boundary of the o.p.w. (r cf ~ 12.997/) was assumed to coincides 
with the outer boundary of the volume wherein the whole protein is encompassed. 
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Figure 4: Figure 4 presents W~ and W + , the emission and absorption rates, respectively, as functions 
of the cluster size R. The location of the intersection of these functions determines the size of the critical 
cluster, R c 
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